Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

نویسندگان

Jakub Zavrel

Walter Daelemans

چکیده

This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis and Development of Urdu POS Tagged Corpus

In this paper, two corpora of Urdu (with 110K and 120K words) tagged with different POS tagsets are used to train TnT and Tree taggers. Error analysis of both taggers is done to identify frequent confusions in tagging. Based on the analysis of tagging, and syntactic structure of Urdu, a more refined tagset is derived. The existing tagged corpora are tagged with the new tagset to develop a singl...

متن کامل

Tagging the Past: Experiments using the Saga Corpus

There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...

متن کامل

Improving Tagging Performance by Using Voting Taggers

We present a bootstrapping method to develop an annotated corpus, which is specially useful for languages with few available resources. The method is being applied to develop a corpus of Spanish of over 5Mw. The method consists on taking advantage of the collaboration of two different POS taggers. The cases in which both taggers agree present a higher accuracy and are used to retrain the taggers.

متن کامل

Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection

In this paper, we describe an ongoing project with the aim of bootstrapping a large Swedish treebank, ultimately with a size of about 1.5 million tokens, by reusing two previously existing annotated corpora: an old treebank of about 350,000 tokens and a more recently developed part-of-speech-tagged corpus of about 1,2 million words. A key component in the bootstrapping methodology is the use of...

متن کامل

Part-of-Speech Tagging of Transcribed Speech

We used four Part-of-Speech taggers, which are available for research purposes and were originally trained on text to tag a corpus of transcribed multiparty spoken dialogues. The assigned tags were then manually corrected. The correction was first used to evaluate the four taggers, then to retrain them. Despite limited resources in time, money and annotators we reached results comparable to tho...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره cs.CL/0007018 شماره

صفحات -

تاریخ انتشار 2000

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

نویسندگان

چکیده

منابع مشابه

Analysis and Development of Urdu POS Tagged Corpus

Tagging the Past: Experiments using the Saga Corpus

Improving Tagging Performance by Using Voting Taggers

Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection

Part-of-Speech Tagging of Transcribed Speech

عنوان ژورنال:

اشتراک گذاری